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SUMMARY 


We present a variant of the solver in |Zepeda-Nunez and De-| 
|manet| { [2014| ), for the 2D high-frequency Helmholtz equation 
in heterogeneous acoustic media. By changing the domain de¬ 
composition from a layered to a grid-like partition, this variant 
yields improved asymptotic online and offline runtimes and a 
lower memory footprint. The solver has online parallel com¬ 
plexity that scales sublinearly as ^ (^), where N is the num¬ 
ber of volume unknowns, and P is the numbe r of processors. 


provided that P = The variant in |Zepeda-Nunez 


and Demanet 


(20141 only afforded P = Algorith- 


mic scalability is a prime requirement for wave simulation in 
regimes of interest for geophysical imaging. 


INTRODUCTION 

Define the Helmholtz equation, in a bounded domain C 
with frequency ft) and squared slowness m(x) = l/c^(x), by 

(^-A-m(x)co^)M(x) =/(x) (1) 

with absorbing boundary conditions. In this note, Eq is 
discretized with a 5-point stencil, and the absorbing bound¬ 
ary conditions are implemented via a perfectly matched layer 
(PML) as described by |Berenger| ( |1994| ). This leads to a lin¬ 
ear system of the form Hu = f. Let N be the total number of 
unknowns of the linear system and n = the number of 
points per dimension. In addition, suppose that the number of 
point within the PML grows as log (A). There is an important 
distinction between: 

• the offline stage, which consists of any precomputation 
involving H, but not f; and 

• the online stage, which involves solving Hu = f for 
many different right-hand sides f. 


Let P for the number of nodes in a distributed memory envi¬ 
ronment. In |Zepeda-Nunez and Demanet| ( |2014) , each layer 
is associated with one node (P = L), and the method’s online 
runtime is ^(N/P) as long as P = The method pre¬ 

sented in this note instead uses a nested domain decomposi¬ 
tion, with a first decomposition into L ^ \/P layers, and a fur¬ 
ther decomposition of each layer into Lc ^ a/P cells, as shown 
in Fig. The resulting online runtime is now ^(N/P) as long 
asP= (^(^1/5). 

The main drawback of the method of polarized traces in |Zepeda| - 
|Nunez and Demanet] ( |2014| ) is its offline precomputation that 
involves computing and storing interface-to-interface local Green’s 
functions. In 3D this approach would become impractical given 
the sheer size of the resulting matrices. To alleviate this is¬ 
sue, the nested-sweep polarized traces method presented in this 
paper involves an equivalent matrix-free approach that relies 
on local solves with sources at the interfaces between layers. 
Given the iterative nature of the preconditioner, solving the lo¬ 
cal problems naively would incur a deterioration of the online 
complexity. This deterioration can be circumvented if we solve 
the local problems inside the layer via the same boundary in¬ 
tegral strategy as earlier, in a nested fashion. This procedure 
can be written as a factorization of Green’s integrals in block- 
sparse factors. 
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By online complexity, we mean the runtime for solving the 
system once in the online stage. The distinction is important 
in situations like geophysical wave propagation, where offline 
precomputations are often amortized over the large number of 
system solves with the same matrix H. 

|Zepeda-Nunez and Demanet| ( |2014| l proposed a hybrid between 
a direct and an iterative method, in which the domain is decom¬ 
posed in L layers. Using the Green’s representation formula to 
couple the subdomains, the problem is reduced to a boundary 
integral system at the layer interfaces, solved iteratively, and 
which can be preconditioned efficiently using a polarizing con¬ 
dition in the form of incomplete Green’s integrals. Finally, the 
integral system and the preconditioner are compressed, yield¬ 
ing a fast application. 


Figure 1: Nested Decomposition in cells. The orange grid- 
points represent the PML for the original problem, the light- 
blue represent the artificial PML between layers, and the pink 
grid-points represent the artificial PML between cells in the 
same layer. 

Finally, the offline complexity is much reduced; instead of 
computing large Green’s functions for each layer, we compute 
much smaller interface-to-interface operators between the in¬ 
terfaces of adjacent cells within each layer, resulting in a lower 
memory requirement. 
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Classical iterative methods require a large number of iterations 
to converge and standard algebraic preconditioners often fail 
to improve the convergence rate to tolerable levels (see |Er-| 
|langga| ( p008| )). Classical domain decomposition methods also 
fail because of internal reverberations (see [Ernst and Gander] 
\20\2\ ). Only recently, the development of a new type of pre- 
condtioners have shown that domain decomposition with accu¬ 
rate transmission boundary conditions is the right mix of ideas 
for simulating propagating high-frequency waves in a hetero¬ 
geneous medium. To a great extent, the approach can be traced 
back to the AILU method of [Gander and Nataf| ( |2000| ). The 
first linear complexity claim was perhaps made in the work of 
[Engquist and Ying| ( [2011[ ), followed by [Stolk[ ( [2013[ ). Other 
authors have since then proposed related methods with simi¬ 
lar properties, including [Chen and Xiang[p013[ ) and [Vion and[ 
[Geuzaine] ( [2014[ ); however, they are often difficult to paral¬ 
lelize, and they usually rely on distributed linear algebra such 
s [Poulson et al.[ ( [2013] l and [Tsuji et al. ([2014[), o r highly tuned 


multigrid methods such as Stolk et ai.[p013[ ) and [Calandra 


et al. 


Ying 


(20131. As we put the final touches to this note, Liu and 


( 2015 1 proposed a recursive sweeping algorithm closely 


related to the one presented in this note. 


In a different direction, much progress has been made on mak¬ 
ing direct methods efficient for the Helmholtz equation. Such 
is the case the HSS solver in[de Hoop et al.[ ( [2011[ ) and hierar¬ 
chical solver in |Gillman et al.||2014[ ). Themain issue with di¬ 
rect methods is the lack of scalability to very large-scale prob¬ 
lems due to the memory requirements. 

To our knowledge, [Zepeda-Nunez and Demanet] ( [2014[ ) and 
this note are the only two references to date that make claims of 
sublinear 0{N/P) online runtime for the 2D Helmholtz equa¬ 
tion. We are unaware that this claim has been made in 3D yet, 
but we are confident that the same ideas have an important role 
to play there as well. 


METHODS 
Polarized Traces 

This section reviews the method in [Zepeda-Nunez and De-[ 
[manet[ ( [2014[ ). Provided that the domain is decomposed in L 
layers, it is possible to reduce the discrete Helmholtz equa¬ 
tion to a discrete integral system at the interfaces between lay¬ 
ers, using the Green’s representation formula. This leads to an 
equivalent discrete integral system of the form 

Mu = f. (2) 

The online stage has three parts: first, at each layer, a local 
solve is performed to build f in Eq. then Eq. |^is solved ob¬ 
taining the traces u of the solution at the interfaces betwen lay¬ 
ers; finally, using the traces, a local solve is performed at each 
layer to obtain the solution u in the volume. The only non¬ 
parallel step is solving Eq. Fortunately, there is a physically 
meaningful way of efficiently preconditioning this system. 

The key is to write yet another equivalent problem where local 
Green’s functions are used to perform polarization into one¬ 
way components. A wave is said to be polarized as up-going 


at an interface L when 

0 = y* dy^,G{x,x')u^{x')dx — J G{x,x')dy^,u^{x')dx , 

as long as x is below L, and vice-versa for down-going waves. 
These polarization conditions create cancellations within the 
discrete integral system, resulting in an easily preconditionable 
system for the polarized interface traces and such that 
u = -hu^. The resulting polarized system is 

m=t s=(^)' (3) 


The matrix M has the structure depicted in Fig. In particular, 
we write 



y 

Dt 


(4) 


where and are, respectively, block lower and upper 
triangular matrices with identity diagonal blocks, thus trivial 
to invert via block backsubstitution. In this note, we use the 
Gauss-Seidel Preconditioner, 


pGS 


yf 

vt 


(D^) 

(v^ — L(D^)“^v^) 


(5) 


The system in Eq. [^is solved using GMRES preconditioned 
with 



Figure 2: Sparsity pattern of M (left) and M (right). 

Moreover, one can use an adaptive partitioned low-rank (PLR, 
also known as ^-matrix) fast algorithm for the application of 
integral kernels, in expressions such as the one above. The im¬ 
plementation details are in [Zepeda-Nunez and Demanet| ( [2014[ l. 

Nested Solvers 

This section now describes the new solver. 

One key observation is that the polarized matrix M can be ap¬ 
plied to a vector in a matrix-free fashion, addressing the of¬ 
fline bottleneck and memory footprint. Each block of M is a 
Green’s integral, and its application to a vector is equivalent 
to sampling a wavefield generated by sources at the bound¬ 
aries. The application of each Green’s integral to a vector v, 
in matrix-free form, consists of three steps: from v we form 
the sources at the interfaces (red interfaces in Fig. j^left), we 
perform a local direct solve inside the layer, and we sample the 
solution at the interfaces (green interfaces in Fig. [fright). 

From the analysis of the rank of off-diagonal blocks of the 
Green’s functions, we know that the Green’s integrals can be 
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compressed and applied in time, but this requires pre- 

computation and storage of the Green’s functions. The matrix- 
free approach does not need expensive precomputations, but 
would naively perform a direct local solve in the volume, re¬ 
sulting in an application of each Green’s integral in /L) 
complexity (assuming that a good direct method is used at each 
layer). This note’s contribution is to propose “nested” alterna¬ 
tives for the inner solves that result in a lower ^{Lc(n/L)^f^) 
complexity (up to logs) for the application of each Green’s in¬ 
tegral. 

Analogously to Eq. denote the boundary integral reduction 
of the local problem within each layer as 

MV = f^, for£=l,.,Lc; (6) 

where the unknowns are on the interfaces between the Lc 
cells. 



Figure 3: Sketch of the application of the Green’s functions. 
The sources are in red (left) and the sampled field in green 
(right). The application uses the inner boundaries as proxies to 
perform the solve. 

The nested solver uses the inner boundaries as proxies to per¬ 
form the local solve inside the layer efficiently. The applica¬ 
tion of the Green’s integral can be decomposed in three steps. 
Using precomputed Green’s functions at each cell we evalu¬ 
ate the wavefield generated from the sources to form (from 
red to pink in Fig. [^left). This operation can be represented 
by a sparse block matrix M^. Then we solve Eq. to ob¬ 
tain (from pink to blue in Fig. [fright). Finally we use 
the Green’s representation formula to sample the wavefield at 
the interfaces (from blue to green in Fig. [^, this operation is 
represented by another sparse-block matrix M^. 

The algorithm described above leads to the factorization 

= (7) 

in which the blocks of and are dense, but highly com¬ 
pressible in PER form. 

Nested polarized traces 

To efficiently apply the Green’s integrals using Eq. we need 
to solve Eq. [^efficiently. A first possibility would be to use the 


Step 

^nodes 

Complexity per node 

EU factorizations 

0(P) 

o((Af/P + log(A?))3/2) 

Green’s functions 

0(P) 

o((Af/P + log(A?))2/2) 

Eocal solves 

0{P) 

0{N/P + \og{NY) 

Sweeps 

1 

0(P(A?/P + log(Af)2)“) 

Recombination 

0{P) 

0{N/P + \og{Nf) 


Table 1: Complexity of the different steps of the precondi¬ 
tioner, in which a depends on the compression of the local 
matrices, thus on the scaling of the frequency with respect to 
the number of unknowns. Typically a = 3/4. 

method of polarized traces in a recursive fashion. The result¬ 
ing recursive polarized traces method has empirically the same 
scalings as those found in |Zepeda-Nunez and DemanetlpOMI ) 
when the blocks are compressed in partitioned low rank (PER) 
form. However, the constants are much larger. 

A compressed-block LU solver 

A better alternative is to use a compressed-block EU solver to 
solve Eq. Given the banded structure of (see Fig. [^, we 
perform a block EU decomposition without pivoting (without 
fill-in), which leads to the factorization 

= m} (E) (E) 

Furthermore, the diagonal blocks of the EU factors are inverted 
explicitly so that the forward and backward substitutions are 
reduced to a series of matrix-vector products. Finally, the 
blocks are compressed in PER form, yielding a fast solve. This 
compressed direct strategy is not practical at the outer level be¬ 
cause of its prohibitive offline cost, but is very advantageous at 
the inner level. 

The online complexity is comparable with the complexity of 
the nested polarized traces, but with much smaller constants. 
The EU factorization has a ^ ^A^/^/P-hlog(A)^/^^ offline 
cost; which is comparable to the polarized traces method, but 
with much smaller constants. 

COMPLEXITY 

Table summarizes the complexities and number of proces¬ 
sors at each stage for both methods. For simplicity we do 
not count the logarithmic factors from the nested dissection; 
however, we consider the logarithmic factors coming from the 
extra degrees of freedom in the PME. 

If the frequency scales as ft) ~ the regime in which sec¬ 
ond order finite-differences are expected to be accurate, we 
observe a = 5/8; however, we assume the more conservative 
value a = 3/4. The latter is in better agreement with a the¬ 
oretical analysis of the rank of the off-diagonal blocks of the 
Green’s functions. In such scenario we have that the blocks 
ofM^andM^ can be compressed in PER form, resulting in a 

fast application in time, easily paral- 

lelizable among Lc nodes. Solving Eq. j^can be solved using 
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either the direct compressed or the nested polarized traces in 
^(Lc(n/L + \og{n))^/'^) time. This yields the previously ad¬ 
vertised runtime of 0’{Lc{n/L-\-\og{n)Y/^) for each applica¬ 
tion of the Green’s integral. 



N 

f[Hz] 

10x2 

40x8 

100 X 20 

88 X 425 

7.71 

(3) 0.89 

(3) 15.6 

(4) 97.9 

175 X 850 

11.1 

(3) 1.48 

(3) 17.7 

(3) 105 

350 X 1700 

15.9 

(3) 2.90 

(3) 22.1 

(4) 106 

700 X 3400 

22.3 

(3) 5.58 

(3)31.3 

(4) 126 

1400 X 6800 

31.7 

(3) 10.5 

(3) 47.9 

(4) 176 


Table 2: Number of GMRES iterations (bold) required to re¬ 
duce the relative residual to 10“^, along with average execu¬ 
tion time (in seconds) of one GMRES iteration using the com¬ 
pressed direct method, for different N and P = Lx Lc. The fre¬ 
quency is scaled such that f = (o/n y/n and the sound speed 
is given by the Marmousi2 model (see [Martin et al.|p006| )). 


To apply the Gauss-Seidel preconditioner we need ^(L) ap¬ 
plications of the Green’s integral, resulting in a runtime of 
^{L'Lc{n/L-\-\og{n))^/'^) to solve Eq. I Using the fact that 
L ~ ^/P and Lc ~ Vp and adding the contribution of the other 
steps of the online stage; we have that the overall online run¬ 
time is givenby^(P^/^A/'^/^+Plog(A/^)^/'^-hA/^/P + log(A/^)^), 
which is sublinear and of the form 0{N/P) up to logs, pro¬ 
vided that P = 



A/^/P-h log(A^)^) and the communication cost for the online 
part is ^(ny/P), which represents an asymptotic improvement 
with respect to |Zepeda-Nunez and Demanet| ( |2014|) , in which 
the storage and communication cost are ^(PN^ -\-N/P -\- 
log(A^)^) and ^(nP), respectively. 


NUMERICAL RESULTS 


Eig. [^depicts the fast convergence of the method. After a 
couple of iterations the exact and approximated solution are 
visually indistinguishable. 

Table shows the sublinear scaling for one GMRES iteration, 
with respect to the degrees of freedom in the volume. We 
can observe that the number of iterations to converge depends 
weakly on the frequency and the number of subdomains. Eig. 
I^shows the empirical scaling for one global GMRES iteration. 
We can observe that both methods have the same asymptotic 
runtime, but with different constants. 



Eigure 5: Runtime for one GMRES iteration using the two 
different nested solves, for L = 9 and = 3, and (O ^ 

DISCUSSION 

We presented an extension to |Zepeda-Nunez and Demanet| ( |2014| |, 
with improved asymptotic runtimes in a distributed memory 
environment. The method has sublinear runtime even in the 
presence of rough media of geophysical interests. Moreover, 
its performance is completely agnostic to the source. 

This algorithm is of especial interest in the context of 4D full 
wave form inversion, continuum reservoir monitoring, and lo¬ 
cal inversion. If the update to the model is localized, most 
of precomputations can be reused. The only extra cost is the 
refactorization and computation of the Green’s functions in the 
cells with a non null intersection with the support of the update, 
reducing sharply the computational cost. 

We point out that this approach can be further parallelized us¬ 
ing distributed algebra libraries. Moreover, the sweeps can be 
pipelined to maintain a constant load among all the nodes. 


Eigure 4: Two iteration of the preconditioner, from top to bot¬ 
tom: initial guess with only local solve; first iteration, second 
iteration, converged solution. 

Einally, the memory footprint is P\og{N)^^^ + 
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